CLAP With Me: Step by Step Semantic Search on Audio Sources

AJ Wallace • Location: Theater 5 • Back to Haystack 2024

Everyone has seen a demo, or maybe even implemented semantic search using vector embeddings. But these implementations are predicated on the idea of a text query finding similarities in text data. What about other forms of data? Maybe you’ve heard of CLIP, a Machine Learning approach to connecting images to text used by companies such as OpenAI. Introducing… CLAP (Contrastive Language–Audio Pre-training), an approach to bring audio and text data to a single multimodal space, unlocking semantic search across audio data. In this talk, we’ll discuss the basics of CLAP – what it is and what it does. Then, we’ll build a small application that generates CLAP vector embeddings from audio files, indexes them to Opensearch, and implements a semantic search query over the audio data. Let’s get CLAP-ing!

AJ Wallace

Splice

I'm an accidental search engineer. Early on, I started as a full stack engineer, leaning more towards the front end. Then I started working at a healthcare company on the search team and fell in love with back end, especially building and tweaking search algorithms to get the most relevant results. I currently work at Splice, leading search and personalization engineering, where we're taking on the difficult problem of increasing relevance for text queries on audio sources. In my spare time I'm a singer-songwriter. I enjoy lifting weights, hanging out with my family, and nerding out on audio equipment and movies.